feat(mlx-grpc): support string stop sequences for chat and completion by zach-li-sudo · Pull Request #1447 · lightseekorg/smg

zach-li-sudo · 2026-05-05T03:06:16Z

Description

Follow-up on (#1099) to support chat/completion with stop field.

curl http://localhost:3000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
     "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
     "stop": ["6"],
     "stream": false,
     "max_tokens": 100
   }'

Problem

support string stop sequences for chat and completion

Solution

convert stop strings to stop token ids before passing to mlx backend

Changes

Test Plan

Unit tests for newly added helper functions
Deploy personal dev stack (SMG + mlx inference backend) with manual tests:
See detailed curl command and real responses here

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- MLX backend now accepts string stop sequences for chat and completion requests.
- Completion streaming responses include per-choice matched-stop information.
Improvements
- Matched-stop resolution now uses full request context for more accurate reporting.
- Better validation and clearer errors when stop-string tokenization is unavailable or invalid (e.g., tokenizer failures).

gemini-code-assist · 2026-05-05T03:06:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-05T03:06:28Z

📝 Walkthrough

Walkthrough

This PR adds MLX stop-string handling: encode user stop strings to token IDs during request build, store stop token IDs in MLX sampling params, and resolve MLX matched-stop token IDs back to user-facing values during response processing; includes tests and a failing tokenizer mock.

Changes

MLX Stop-Sequence Support

Layer / File(s)	Summary
Response Data Model `crates/protocols/src/completion.rs`	`CompletionStreamChoice` adds optional `matched_stop` and derives `Default` to support streaming matched-stop values.
MLX Request Validation `crates/grpc_client/src/mlx_engine.rs`	Removed `reject_stop_strings` from chat and completion request builders and updated comment doc to allow string stop sequences for chat/completion (messages/generate still reject).
Test Tokenizer Infrastructure `crates/tokenizer/src/mock.rs`	`MockTokenizer` adds `fail_encode: bool` and `failing()` constructor to simulate encode failures in tests.
Stop-Sequence Resolution Helpers `model_gateway/src/routers/grpc/utils/chat_utils.rs`	Adds `stop_strings_to_token_ids`, `resolve_mlx_matched_stop_json`, and `resolve_mlx_stop_ids` to convert stop strings to single-token IDs and map MLX matched token IDs back to user-visible JSON.
Stop-Sequence Test Coverage `model_gateway/src/routers/grpc/utils/chat_utils.rs`	Unit tests covering tokenization success, multi-token/zero-token/unknown-token errors, tokenizer encode failure, missing tokenizer, and matched-stop precedence.
MLX Stop-Sequence Injection Helper `model_gateway/src/routers/grpc/common/stages/helpers.rs`	Adds `apply_mlx_stop_sequences` to validate tokenizers and append resolved stop token IDs to MLX sampling parameters.
Chat and Completion Request Building `model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs`, `model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs`	Request-building stages call `apply_mlx_stop_sequences` after metadata injection to populate MLX sampling params before finalizing the proto request.
Proto Wrapper Context-Aware Resolution `model_gateway/src/routers/grpc/proto_wrapper.rs`	Adds internal helper to extract MLX matched-stop token ID and moves MLX matched-stop resolution into `matched_stop_json_with_context()` using `resolve_mlx_matched_stop_json`.
Non-Streaming Response Processing `model_gateway/src/routers/grpc/regular/processor.rs`	`process_single_choice` and `process_non_streaming_completion_response` use `matched_stop_json_with_context()` with request stop fields and tokenizer.
Streaming Response Processing `model_gateway/src/routers/grpc/regular/streaming.rs`	Chat/completion streaming compute matched_stop using context-aware resolution; completion streaming uses `..Default::default()` for intermediate choice fields and sets `matched_stop` on final chunks.
Utils Re-exports `model_gateway/src/routers/grpc/utils/mod.rs`	Re-exports `resolve_mlx_matched_stop_json` and `resolve_mlx_stop_ids`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

lightseekorg/smg#978: Prior work adding completion streaming support that this PR extends for matched_stop population.
lightseekorg/smg#1034: Introduced MLX gRPC client/proto plumbing that this change builds upon for MLX matched-stop handling.
lightseekorg/smg#602: Earlier matched_stop handling changes in proto wrapper and streaming paths related to this PR.

Suggested labels

tests

Suggested reviewers

key4ng
CatherineSue
slin1237

"I nibble tokens, hop and play,
I map each stop in bright array,
From string to id, then back again,
The pipeline hums — a rabbit's gain! 🐇✨"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(mlx-grpc): support string stop sequences for chat and completion' clearly and specifically summarizes the main change: adding support for string stop sequences in the MLX gRPC backend for both chat and completion request pipelines.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4ce20db09

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 83-99: The code calls resolve_mlx_stop_ids(stop, tokenizer) before
verifying the request is the MLX variant, which can tokenize unnecessarily and
produce spurious errors; change the order so you first match on
ProtoGenerateRequest::Mlx (e.g., if let ProtoGenerateRequest::Mlx(req) =
proto_request { ... }), bail early if not MLX, then check for Some(stop) and
only then call resolve_mlx_stop_ids(stop, tokenizer) and extend
sampling.stop_token_ids; ensure you still return Ok(()) when stop is None or
when sampling_params is missing.

In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 741-747: The MLX variant's matched_stop_json() can return raw
integer token IDs; update the five unguarded call sites
(process_non_streaming_generate_response,
process_non_streaming_messages_response, process_non_streaming_chat_response
(harmony), the Harmony streaming response processing, and the Harmony streaming
variant) so they do not consume raw matched_stop_json() directly: either guard
with is_mlx() and call resolve_mlx_matched_stop_json() for Mlx, or always call
resolve_mlx_matched_stop_json() (which uses mlx_matched_stop_token_id()) before
using the value; ensure any code paths that previously read matched_stop_json()
now receive the resolved string form.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 422-453: Update the docstring for stop_strings_to_token_ids to
document that tokenizer.encode(...) errors are not propagated but are logged and
the corresponding stop string is skipped (i.e., both zero-token encodings and
encoder errors are warn-and-skipped), and clarify that only multi-token
encodings produce an Err result; reference the function name
stop_strings_to_token_ids and the call tokenizer.encode to make clear where this
behavior occurs.
- Around line 496-508: apply_mlx_stop_sequences currently extends
sampling.stop_token_ids with values converted by resolve_mlx_stop_ids without
checking if the request already set explicit stop_token_ids; add a validation
that rejects the request when both stop strings and explicit stop_token_ids are
provided. To fix, change apply_mlx_stop_sequences (or its caller) to receive the
original request's stop_token_ids (or a boolean flag) and if that vector is
non-empty and stop_strings is present, return a bad_request error; alternatively
perform this check in the caller before invoking apply_mlx_stop_sequences.
Ensure the error originates from the same error::bad_request pattern and
reference sampling.stop_token_ids, apply_mlx_stop_sequences, and
resolve_mlx_stop_ids in your change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8297696e-30a4-4e03-9896-ce3f2cb893d9

📥 Commits

Reviewing files that changed from the base of the PR and between 829604f and b4ce20d.

📒 Files selected for processing (13)

Cargo.toml
crates/grpc_client/src/mlx_engine.rs
crates/protocols/src/completion.rs
crates/tokenizer/src/mock.rs
model_gateway/Cargo.toml
model_gateway/src/routers/grpc/common/stages/helpers.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/mod.rs

key4ng · 2026-05-07T16:10:52Z

Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge:

1. `is_mlx()` branching is repeated 4× — push it into the wrapper

The same if complete.is_mlx() { resolve_mlx_matched_stop_json(...) } else { complete.matched_stop_json() } block appears in:

model_gateway/src/routers/grpc/regular/processor.rs:187 (chat)
model_gateway/src/routers/grpc/regular/processor.rs:834 (completion)
model_gateway/src/routers/grpc/regular/streaming.rs:495 (chat stream)
model_gateway/src/routers/grpc/regular/streaming.rs:2589 (completion stream)

proto_wrapper.rs already encapsulates backend differences (matched_stop_json(), is_mlx_streaming_finished(), etc.). Could you add a matched_stop_json_with_context(stop, stop_token_ids, tokenizer) method on ProtoGenerateComplete that internally dispatches on the variant? Each caller collapses to one line, and mlx_matched_stop_token_id() no longer needs to be pub. Composes better when the next backend hits the same quirk.

2. `apply_mlx_stop_sequences` is double-gated

model_gateway/src/routers/grpc/common/stages/helpers.rs:82 pattern-matches ProtoGenerateRequest::Mlx(req), but both callers (chat/request_building.rs:111, completion/request_building.rs:90) already gate on builder_client.is_mlx(). Pick one — either the helper is unconditional and no-ops on non-MLX, or callers gate and the helper assumes MLX.

3. Silent encode-error swallow is a footgun

chat_utils.rs:443 — stop_strings_to_token_ids correctly returns Err (→ 400) for multi-token strings, but silently warn!s and drops when the tokenizer errors or returns zero tokens. The new MockTokenizer::failing() test at line 904 locks this in:

// "Tokenizer errors are silently skipped; the function returns Ok(empty)."
assert!(result.unwrap().is_empty());

Both are the same class of "we cannot honor your stop condition" failure — caller sets stop: ["foo"], encoder errors, request proceeds to max_tokens, user blames SMG. Recommend surfacing both as 400.

4. Is `test-case = "3.3.1"` worth a new workspace dep?

It's introduced as a brand-new workspace dependency just for ~13 parameterized cases across 2 functions in chat_utils.rs. Not used anywhere else in the workspace today. A plain #[test] with a slice of (input, expected, name) tuples and a loop gets the same coverage without the dep — and keeps the workspace lean. Up to you, but worth weighing.

Smaller items

streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.
Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.
mlx_engine.rs:245 comment update is good; retaining reject_stop_strings for messages/generate paths is correct.

After (1)–(3), the MLX surface area shrinks a lot and the "MLX branches everywhere" feel goes away.

mergify · 2026-05-10T22:20:26Z

Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7e1728cf1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

♻️ Duplicate comments (1)

model_gateway/src/routers/grpc/common/stages/helpers.rs (1)

283-305: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Optimize by checking MLX variant before tokenization.

The current order (check stop, tokenize, then check MLX variant) can waste CPU tokenizing stops for non-MLX backends. Checking the MLX variant first avoids unnecessary tokenization:

♻️ Proposed reordering

 pub(crate) fn apply_mlx_stop_sequences(
     proto_request: &mut ProtoGenerateRequest,
     stop: Option<&StringOrArray>,
     tokenizer: Option<&dyn Tokenizer>,
 ) -> Result<(), Response> {
+    let ProtoGenerateRequest::Mlx(req) = proto_request else {
+        return Ok(());
+    };
     let Some(stop) = stop else {
         return Ok(());
     };
-
-    if let ProtoGenerateRequest::Mlx(req) = proto_request {
-        let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?;
-        let sampling = req.sampling_params.as_mut().ok_or_else(|| {
-            error::internal_error(
-                "mlx_sampling_params_missing",
-                "MLX GenerateRequest has no sampling_params; cannot inject stop IDs",
-            )
-        })?;
-        sampling.stop_token_ids.extend(token_ids);
-    }
-
-    Ok(())
+    let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?;
+    let sampling = req.sampling_params.as_mut().ok_or_else(|| {
+        error::internal_error(
+            "mlx_sampling_params_missing",
+            "MLX GenerateRequest has no sampling_params; cannot inject stop IDs",
+        )
+    })?;
+    sampling.stop_token_ids.extend(token_ids);
+    Ok(())
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/common/stages/helpers.rs` around lines 283 -
305, The function apply_mlx_stop_sequences currently tokenizes stop sequences
via resolve_mlx_stop_ids before confirming the proto_request is the MLX variant,
causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 283-305: The function apply_mlx_stop_sequences currently tokenizes
stop sequences via resolve_mlx_stop_ids before confirming the proto_request is
the MLX variant, causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a1ca32dd-17bc-490f-9af2-dc65c5b82612

📥 Commits

Reviewing files that changed from the base of the PR and between b4ce20d and c7e1728.

📒 Files selected for processing (11)

crates/grpc_client/src/mlx_engine.rs
crates/protocols/src/completion.rs
crates/tokenizer/src/mock.rs
model_gateway/src/routers/grpc/common/stages/helpers.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/mod.rs

…lightseekorg#1099) Signed-off-by: Zhuo Li <[email protected]>

Signed-off-by: Zhuo Li <[email protected]>

…d no-ops on non-MLX Signed-off-by: Zhuo Li <[email protected]>

Signed-off-by: Zhuo Li <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fac846fbd6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

zach-li-sudo · 2026-05-10T22:40:35Z

Manual testing results: stop sequence with MLX backend

1. SMG with MLX backend setup

cargo build
Install the gRPC servicer (first time only):

pip install -e grpc_servicer/

Start the MLX backend:

source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/Qwen3-4B-Instruct-2507-4bit --port 50051

Start the SMG gateway server:

./target/debug/smg --worker-urls grpc://localhost:50051 --port 3000

2. Testing scenarios

This PR is to support string stop array in regular
paths: chat, chat completions, streaming chat, streaming chat completion.

Key differences from vLLM

Feature	vLLM	MLX
Single-token stop strings (`"stop": ["6"]`)	✅ supported	✅ tokenized to token ID
Multi-token stop strings (`"stop": ["hello world"]`)	✅ supported	❌ returns HTTP 400
`stop_token_ids`	✅ supported	✅ supported

The gateway converts single-token stop strings to token IDs for MLX requests (the MLX proto has
no native string stop field). Multi-token stop strings are rejected with HTTP 400
(unsupported_stop_string), since MLX can only stop on a single token at a time.

Token ID reference

Token IDs used in the tests below assume the Qwen3 tokenizer, which shares vocabulary with Qwen2.5:

Token 20 → "5"
Token 21 → "6"

Verify with the tokenizer if results are unexpected:

from mlx_lm import load
_, tokenizer = load("mlx-community/Qwen3-4B-Instruct-2507-4bit")
print(tokenizer.encode("5 6"))   # check IDs for context-free "5" and "6"

3. Results

Test matrix: 4 paths × 5 stop modes = 20 cases.

Single-token stop string cases (3.x.1): PASS = 200, matched_stop = the stop string.
Multi-token stop string cases (3.x.2): PASS = 400 with an appropriate error message.
stop_token_ids cases (3.x.3): PASS = 200, matched_stop = the matched token ID integer.
Combined single-token stop string + stop_token_ids (3.x.4): PASS = 200, matched_stop = integer or string depending on which stop fires first.
Combined multi-token stop string + stop_token_ids (3.x.5): PASS = 400 — multi-token string is rejected regardless of stop_token_ids.

#	Endpoint	stream	Stop mode	Expected	Result
3.1.1	`/v1/chat/completions`	false	stop string (single token)	pass	200 ✅
3.1.2	`/v1/chat/completions`	false	stop string (multi-token)	400	400 ✅
3.1.3	`/v1/chat/completions`	false	stop_token_ids	pass	200 ✅
3.1.4	`/v1/chat/completions`	false	combined: single-token stop + stop_token_ids	pass	200 ✅
3.1.5	`/v1/chat/completions`	false	combined: multi-token stop + stop_token_ids	400	400 ✅
3.2.1	`/v1/chat/completions`	true	stop string (single token)	pass	200 ✅
3.2.2	`/v1/chat/completions`	true	stop string (multi-token)	400	400 ✅
3.2.3	`/v1/chat/completions`	true	stop_token_ids	pass	200 ✅
3.2.4	`/v1/chat/completions`	true	combined: single-token stop + stop_token_ids	pass	200 ✅
3.2.5	`/v1/chat/completions`	true	combined: multi-token stop + stop_token_ids	400	400 ✅
3.3.1	`/v1/completions`	false	stop string (single token)	pass	200 ✅
3.3.2	`/v1/completions`	false	stop string (multi-token)	400	400 ✅
3.3.3	`/v1/completions`	false	stop_token_ids	pass	200 ✅
3.3.4	`/v1/completions`	false	combined: single-token stop + stop_token_ids	pass	200 ✅
3.3.5	`/v1/completions`	false	combined: multi-token stop + stop_token_ids	400	400 ✅
3.4.1	`/v1/completions`	true	stop string (single token)	pass	200 ✅
3.4.2	`/v1/completions`	true	stop string (multi-token)	400	400 ✅
3.4.3	`/v1/completions`	true	stop_token_ids	pass	200 ✅
3.4.4	`/v1/completions`	true	combined: single-token stop + stop_token_ids	pass	200 ✅
3.4.5	`/v1/completions`	true	combined: multi-token stop + stop_token_ids	400	400 ✅

Reading the streaming output below: Only the first content chunk and the final (stop) chunk are
shown. Intermediate content chunks are collapsed to # ... (content: "...") ....

3.1 Chat, non-stream

3.1.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "chatcmpl-019e12f3-af39-7bc1-a46c-4173bebe0cbc",
    "object": "chat.completion",
    "created": 1778434420,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n5  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": "6"
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 11,
      "total_tokens": 32
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.1.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.1.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/chat/completions -s \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq

{
  "id": "chatcmpl-019e12f3-b06a-7e30-9026-b277ef4cd022",
  "object": "chat.completion",
  "created": 1778434420,
  "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1  \n2  \n3  \n4  \n",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 9,
    "total_tokens": 30
  },
  "system_fingerprint": "default"
}

3.1.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
    "object": "chat.completion",
    "created": 1778448404,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 9,
      "total_tokens": 30
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.1.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.2 Chat, stream

3.2.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1  ","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n5  \n") ...

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":"6"}]}

data: [DONE]

HTTP 200

3.2.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.2.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/chat/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n") ...

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":20}]}

data: [DONE]

3.2.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1  ","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n") ...

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":20}]}

data: [DONE]

HTTP 200

3.2.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.3 Completion, non-stream

3.3.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "cmpl_019e12f4-9eda-71c2-bf57-ab5fca3036f6",
    "object": "text_completion",
    "created": 1778434481,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "text": "5\n",
        "index": 0,
        "finish_reason": "stop",
        "matched_stop": "6"
      }
    ],
    "usage": {
      "prompt_tokens": 22,
      "completion_tokens": 3,
      "total_tokens": 25
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.3.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.3.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/completions -s \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq

{
  "id": "cmpl_019e12f4-9faf-7582-bcc0-bb7849cf6585",
  "object": "text_completion",
  "created": 1778434482,
  "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 1,
    "total_tokens": 23
  },
  "system_fingerprint": "default"
}

Empty text because the stop token was the very next token after the prompt — stopped immediately.

3.3.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "cmpl_019e13c9-6957-7f10-8038-53f7c215dd87",
    "object": "text_completion",
    "created": 1778448427,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "text": "",
        "index": 0,
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 22,
      "completion_tokens": 1,
      "total_tokens": 23
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.3.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.4 Completion, stream

3.4.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"cmpl_019e12f4-bb3b-7b80-b016-13a6be12605c","object":"text_completion","created":1778434489,"choices":[{"text":"5","index":0,"finish_reason":null}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

# ... (content: "5\n") ...

data: {"id":"cmpl_019e12f4-bb3b-7b80-b016-13a6be12605c","object":"text_completion","created":1778434489,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

HTTP 200

3.4.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.4.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"cmpl_019e12f4-bc18-7b71-b9a6-fb673df80244","object":"text_completion","created":1778434489,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

3.4.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"cmpl_019e13c9-8d8d-75a0-bba2-d7da2fc33066","object":"text_completion","created":1778448436,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

HTTP 200

3.4.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

zach-li-sudo · 2026-05-11T01:23:20Z

Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge:

1. is_mlx() branching is repeated 4× — push it into the wrapper

The same if complete.is_mlx() { resolve_mlx_matched_stop_json(...) } else { complete.matched_stop_json() } block appears in:

model_gateway/src/routers/grpc/regular/processor.rs:187 (chat)

model_gateway/src/routers/grpc/regular/processor.rs:834 (completion)

model_gateway/src/routers/grpc/regular/streaming.rs:495 (chat stream)

model_gateway/src/routers/grpc/regular/streaming.rs:2589 (completion stream)

proto_wrapper.rs already encapsulates backend differences (matched_stop_json(), is_mlx_streaming_finished(), etc.). Could you add a matched_stop_json_with_context(stop, stop_token_ids, tokenizer) method on ProtoGenerateComplete that internally dispatches on the variant? Each caller collapses to one line, and mlx_matched_stop_token_id() no longer needs to be pub. Composes better when the next backend hits the same quirk.

2. apply_mlx_stop_sequences is double-gated

model_gateway/src/routers/grpc/common/stages/helpers.rs:82 pattern-matches ProtoGenerateRequest::Mlx(req), but both callers (chat/request_building.rs:111, completion/request_building.rs:90) already gate on builder_client.is_mlx(). Pick one — either the helper is unconditional and no-ops on non-MLX, or callers gate and the helper assumes MLX.

3. Silent encode-error swallow is a footgun

chat_utils.rs:443 — stop_strings_to_token_ids correctly returns Err (→ 400) for multi-token strings, but silently warn!s and drops when the tokenizer errors or returns zero tokens. The new MockTokenizer::failing() test at line 904 locks this in:
// "Tokenizer errors are silently skipped; the function returns Ok(empty)."
assert!(result.unwrap().is_empty());
Both are the same class of "we cannot honor your stop condition" failure — caller sets stop: ["foo"], encoder errors, request proceeds to max_tokens, user blames SMG. Recommend surfacing both as 400.

4. Is test-case = "3.3.1" worth a new workspace dep?

It's introduced as a brand-new workspace dependency just for ~13 parameterized cases across 2 functions in chat_utils.rs. Not used anywhere else in the workspace today. A plain #[test] with a slice of (input, expected, name) tuples and a loop gets the same coverage without the dep — and keeps the workspace lean. Up to you, but worth weighing.

Smaller items

streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.

Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.

mlx_engine.rs:245 comment update is good; retaining reject_stop_strings for messages/generate paths is correct.

After (1)–(3), the MLX surface area shrinks a lot and the "MLX branches everywhere" feel goes away.

Hi Keyang, thanks for your comments! The issues you've mentioned are resolved as follows:

is_mlx() branching is repeated 4× — push it into the wrapper: fixed in commit #6f390899 by moving mlx match_stop processing logic into proto wrapper
apply_mlx_stop_sequences is double-gated: fixed in commit #15c12271 fix double gated apply_mlx_stop_sequences: helper is unconditional and no-ops on non-MLX
Silent encode-error swallow is a footgun: fixed in commit #271cec93 fix silent encode-error: zero token and failed tokenizer throw 400 error
Is test-case = "3.3.1" worth a new workspace dep: dev dependency removed, newly added unit tests are refactored using a plain #[test] with a slice of (input, expected, name) tuples and a loop.

Other items:

streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.

Solution: used ..Default::default() fror the construct CompletionStreamChoice fields. See #94eef5a2.

Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.

Yes, the overall test plan executed on my local Mac with real response are done! All test cases are summarized in this test matrix, including {stream, non-stream} x {chat, chat completion} x {stop token ids, stop strings, combined, ...}. Please check linked comment.

I see some other gaps (not directly correlated to this PR) in MLX backend support, like harmony paths, messages/generate endpoints etc. Will discuss these with my investigation later.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ddd72c071

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T02:45:19Z

-                .matched_stop_token_id
-                .map(|id| serde_json::Value::Number(id.into())),
+            // MLX requires request context to resolve the token ID; use matched_stop_json_with_context.
+            Self::Mlx(_) => unreachable!("matched_stop_json called for MLX backend"),


Keep MLX matched-stop lookup non-panicking for Harmony

Changing matched_stop_json() to unreachable!() for Self::Mlx now makes active Harmony MLX paths crash at runtime, because those flows still call the old method (model_gateway/src/routers/grpc/harmony/processor.rs:62 and .../harmony/streaming.rs:299) instead of matched_stop_json_with_context. Any Harmony request routed to MLX that reaches a Complete frame will panic rather than returning a response, so this should remain non-panicking until all Harmony call sites are migrated.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/grpc_client/src/mlx_engine.rs`:
- Around line 245-246: Update the inline comment that currently says "Messages
and Generate pipelines still reject string stops" to list all three pipelines
that reject string stops (Messages, Generate, and Responses). Locate the comment
near the top of mlx_engine.rs (the block describing stop-sequence support) and
amend the sentence to explicitly include "Responses" alongside "Messages" and
"Generate", and ensure it references the existing reject_stop_strings check used
in the Responses handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 34dfd186-7722-4fc6-a55b-bb7feb896de4

📥 Commits

Reviewing files that changed from the base of the PR and between c7e1728 and 5ddd72c.

📒 Files selected for processing (11)

crates/grpc_client/src/mlx_engine.rs
crates/protocols/src/completion.rs
crates/tokenizer/src/mock.rs
model_gateway/src/routers/grpc/common/stages/helpers.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/mod.rs

coderabbitai · 2026-05-14T02:47:02Z

+    //   - String stop sequences: supported in chat and completion pipelines.
+    //     Messages and Generate pipelines still reject string stops (see reject_stop_strings).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the comment to mention all three pipelines that still reject string stops.

The comment states that "Messages and Generate pipelines still reject string stops," but the Responses pipeline (line 422) also retains the reject_stop_strings check. For completeness, the comment should list all three.

📝 Proposed fix to make the documentation complete

- // - String stop sequences: supported in chat and completion pipelines. - // Messages and Generate pipelines still reject string stops (see reject_stop_strings). + // - String stop sequences: supported in chat and completion pipelines. + // Messages, Generate, and Responses pipelines still reject string stops (see reject_stop_strings).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// - String stop sequences: supported in chat and completion pipelines.

// Messages and Generate pipelines still reject string stops (see reject_stop_strings).

// - String stop sequences: supported in chat and completion pipelines.

// Messages, Generate, and Responses pipelines still reject string stops (see reject_stop_strings).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/grpc_client/src/mlx_engine.rs` around lines 245 - 246, Update the inline comment that currently says "Messages and Generate pipelines still reject string stops" to list all three pipelines that reject string stops (Messages, Generate, and Responses). Locate the comment near the top of mlx_engine.rs (the block describing stop-sequence support) and amend the sentence to explicitly include "Responses" alongside "Messages" and "Generate", and ensure it references the existing reject_stop_strings check used in the Responses handling.

zach-li-sudo requested review from CatherineSue, key4ng and slin1237 as code owners May 5, 2026 03:06

github-actions Bot added tokenizer Tokenizer related changes dependencies Dependency updates grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels May 5, 2026

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

Comment thread crates/grpc_client/src/mlx_engine.rs

coderabbitai Bot requested changes May 5, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/common/stages/helpers.rs

Comment thread model_gateway/src/routers/grpc/proto_wrapper.rs

Comment thread model_gateway/src/routers/grpc/utils/chat_utils.rs

Comment thread model_gateway/src/routers/grpc/utils/chat_utils.rs

slin1237 assigned key4ng May 5, 2026

zach-li-sudo changed the title ~~feat(mlx-grpc): support string stop sequences for chat and completion…~~ feat(mlx-grpc): support string stop sequences for chat and completion May 7, 2026

zach-li-sudo force-pushed the mlx-grpc branch from b4ce20d to c7e1728 Compare May 10, 2026 22:19

github-actions Bot removed the dependencies Dependency updates label May 10, 2026

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/proto_wrapper.rs

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

zach-li-sudo added 8 commits May 10, 2026 15:28

feat(mlx-grpc): support string stop sequences for chat and completion (…

e60ab13

…lightseekorg#1099) Signed-off-by: Zhuo Li <[email protected]>

move mlx match_stop processing logic into proto wrapper

6f39089

Signed-off-by: Zhuo Li <[email protected]>

fix double gated apply_mlx_stop_sequences: helper is unconditional an…

15c1227

…d no-ops on non-MLX Signed-off-by: Zhuo Li <[email protected]>

fix silent encode-error: zero token and failed tokenizer throw 400 error

271cec9

Signed-off-by: Zhuo Li <[email protected]>

fix: remove test-case dep and refactor unit tests in chat utils

486da46

Signed-off-by: Zhuo Li <[email protected]>

use default values for CompletionStreamChoice fields

94eef5a

Signed-off-by: Zhuo Li <[email protected]>

rebase before pushing

b938bc9

Signed-off-by: Zhuo Li <[email protected]>

fix fmt and clippy issues

fac846f

Signed-off-by: Zhuo Li <[email protected]>

zach-li-sudo force-pushed the mlx-grpc branch from c7e1728 to fac846f Compare May 10, 2026 22:29

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs

coderabbitai Bot approved these changes May 14, 2026

View reviewed changes

Merge branch 'main' into mlx-grpc

5ddd72c

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

coderabbitai Bot requested changes May 14, 2026

View reviewed changes

		// - String stop sequences: supported in chat and completion pipelines.
		// Messages and Generate pipelines still reject string stops (see reject_stop_strings).

Conversation

zach-li-sudo commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

key4ng commented May 7, 2026

1. is_mlx() branching is repeated 4× — push it into the wrapper

2. apply_mlx_stop_sequences is double-gated

3. Silent encode-error swallow is a footgun

4. Is test-case = "3.3.1" worth a new workspace dep?

Smaller items

Uh oh!

mergify Bot commented May 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

zach-li-sudo commented May 10, 2026

Manual testing results: stop sequence with MLX backend

1. SMG with MLX backend setup

2. Testing scenarios

Key differences from vLLM

Token ID reference

3. Results

3.1 Chat, non-stream

3.2 Chat, stream

3.3 Completion, non-stream

3.4 Completion, stream

Uh oh!

zach-li-sudo commented May 11, 2026

1. is_mlx() branching is repeated 4× — push it into the wrapper

2. apply_mlx_stop_sequences is double-gated

3. Silent encode-error swallow is a footgun

4. Is test-case = "3.3.1" worth a new workspace dep?

Smaller items

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

zach-li-sudo commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

1. `is_mlx()` branching is repeated 4× — push it into the wrapper

2. `apply_mlx_stop_sequences` is double-gated

4. Is `test-case = "3.3.1"` worth a new workspace dep?

1. `is_mlx()` branching is repeated 4× — push it into the wrapper

2. `apply_mlx_stop_sequences` is double-gated

4. Is `test-case = "3.3.1"` worth a new workspace dep?